The investigation focuses on discerning patterns in contributing factors influencing traffic accidents in NYC. Initially, we aim to identify prevalent factors contributing to these incidents. Subsequently, an exploration ensues to ascertain potential associations between these contributing factors and variables such as Visibility, Wind’s Direction, and Wind’s Speed. This analytic approach aims to unveil underlying relationships and enhance our understanding of the multifaceted dynamics that influence traffic accidents in the city. By examining the interplay of contributing factors with weather-related variables, we seek to provide valuable insights that contribute to a comprehensive understanding of the factors influencing road safety in the urban context.
In the association heat map, two variables stand out as peculiar: “Turning_Loop” and “Roundabout.” Both variables exhibit only one value (False in this case). Consequently, we can safely exclude these two variables from our subsequent analysis, as their singular values do not contribute meaningfully to our research outcomes. This strategic decision allows us to focus on the variables that offer more variation and potential impact on our results.
Moreover, a quantitative analysis through a graph provides clearer insights. Notably, the variable “Crossing” exhibits a strong correlation with “Traffic_Signal,” which is logical given the intricate scenarios at intersections where drivers must adhere to traffic signals to prevent accidents. Additionally, “Traffic_Calming” and “Bump” show some association, as the presence of additional traffic calming measures increases the likelihood of encountering a bump, whether it involves interactions between vehicles or with these barriers. On the contrary, several factors, such as “Give_Away,” “Junction,” and “No_Exit,” lack a discernible association with others.
Given the huge data amount of our dataset, we have opted to visualize only the top-3 contributing factors using the ggalluvium plot. In conjunction with the previously presented graphs, it becomes evident that to mitigate traffic accidents effectively, governmental focus and resources should be directed towards enhancing the safety and maintenance of crossing, junction, and traffic signal infrastructure across all listed areas in NYC. These factors not only significantly contribute to accidents but also exhibit interconnectedness, as seen in the relationship between crossing and traffic signals.
Quantitative Factors
We would like to further research on some other quantitative factors, to see if there is some distribution difference in different logic factors.
While there may be an initial inclination to attribute accidents to visibility issues, our ridge density graphs reveal that the top nine common factor labels in traffic accidents do not exhibit discernible differences in visibility distribution between True and False cases.
Code
tem <- q_reason[, c("Temperature.F.", top_9)]long_tem <-gather(tem, key ="Factors", value ="Value", 2:10)ggplot(long_tem, aes(y = Value, x = Temperature.F.)) +geom_density_ridges() +facet_wrap(~ Factors, scales ="free_y") +theme_minimal() +labs(title ="Ridge graph of Temperature by Factors")
However, we observe variations concerning temperature. Specifically, we note that temperatures tend to concentrate in two extremes, indicating that both cold and hot weather conditions may have an impact on road conditions and drivers’ concentration, which lead to a traffic accident associated with this factor label.
3.2 Temporal Trends:
Since our dataset contains traffic accidents in NYC from 2016 to 2023, we are willing to analyze the temporal trends for the major factor labels and some other characters in our dataset.
quant_data <- data[, c('Start_Time', 'County', names(column_sums))]quant_data <- quant_data %>%filter(!is.na(ymd_hm(Start_Time)))quant_data <- quant_data %>%mutate(Year = lubridate::year(Start_Time),Month = lubridate::month(Start_Time, label =TRUE, abbr =TRUE))temp_quant <- quant_data %>%group_by(County, Year, Month) %>%summarise(Count =n())temp_quant$Month <-factor(temp_quant$Month, levels = month.abb)ggplot(temp_quant, aes(x = Month, y = Count, fill =factor(County))) +geom_bar(stat ="identity", position ="dodge") +facet_wrap(~ County, scales ="free_y") +labs(title ="Number of Accidents by Month Sum All Year")
We find an interesting insight, that for most of area (or except Bronx), there is a count drop from Oct. to next year’s April. Assume the data has no bias when collecting, we could know that the accidents mostly happen in summer. Since NYC is a modern city, we do not find a trend that the cold weather would affect the road situation, leading extra traffic accidents.
We processed our data for further use in d3 graph. What we did was to visualize the trend of count for each reason over month. Thus, we are able to observe how specific reasons that cause traffic accidents change over time.
Examining the graph, we observe that “Crossing” consistently maintains its top rank from 2016 to 2022, establishing itself as the primary location for traffic accidents in NYC. Although the data collection extends only until March 2023, it is plausible to anticipate that “Crossing” will retain its leading position for the entire year.
Despite this consistency in the top-ranking location, there is a lack of significant temporal changes for major factor labels throughout these years. This suggests that the data collection process for traffic accidents is robust and provides a stable statistical foundation. Furthermore, this stability underscores the need for continued vigilance by both the government and the community, aligning their efforts with the ranking depicted in the graph to effectively address and mitigate the significant impact of traffic accidents.
3.3 Spatial Variation in Traffic Accident:
Temporal Trend for Spatial Data
We now will step into the spatial character for our dataset. However, for a fair comparison, we need to check the temporal trend for the amount of traffic accidents in differen area.
Code
month_counts <- quant_data %>%group_by(County, Year) %>%summarise(UniqueMonths =n_distinct(Month))year_data <- quant_data %>%group_by(County, Year) %>%summarise(Count =n())avg_data <- year_data %>%left_join(month_counts, by =c("County", "Year")) %>%mutate(AvgCount = Count / UniqueMonths)ggplot(avg_data, aes(x = Year, y = AvgCount, fill = County)) +geom_bar(stat ="identity", position ="dodge", color ="black") +labs(title ="Average Count per Month for Each County",x ="Year",y ="Average Accident Count per Month") +theme_minimal() +facet_wrap(~ County, scales ="free_y")
Analyzing the graph, we observe a consistent upward trend in the monthly averaged number of traffic accidents from 2016 to 2022. Particularly noteworthy is a sudden spike in 2022, resulting in a significant increase in traffic accident occurrences and subsequent societal losses. In Queens, a slight decrease is evident in 2018, but attention should be directed to Richmond, where the order of accident numbers deviates from the overall trend, as previously indicated in the graph.
Richmond presents a peculiar pattern from 2017 to 2021, resembling a “U” shape and resulting in a bimodal distribution. This anomaly is tentatively attributed to potential issues with data collection. Despite utilizing monthly averages for each year, there is a notable decrease in 2023. It’s crucial to consider that our data collection only extends until March of this year, and as we’ve established earlier, traffic accidents are less frequent from October to April of the following year.
Severity Analysis in Different Area
We would also like to analysis the Severity distribution and their potential cause in 5 different areas.
Code
sever_data <- data[, c(3, 14, 30:42)]sever_data <- sever_data %>%mutate(TrueCount =rowSums(select(., 3:15) ==TRUE))heatmap_data <- sever_data %>%group_by(County, Severity, TrueCount) %>%summarise(Count =n())ggplot(heatmap_data, aes(x = TrueCount, y = Severity, fill = Count)) +geom_tile() +scale_fill_viridis_c() +facet_wrap(~ County) +labs(title ="Heatmap of Severity vs Factors by County",x ="Factors Count",y ="Severity",fill ="Count")
Analyzing the heatmap, we observe a consistent pattern in the distribution of factor labels across different severities. In all areas, the majority of traffic accidents occur with 0-1 factor labels and severity levels 2-3. This finding reinforces the notion that attributing the severity of traffic accidents to specific areas might not be a significant factor. As a result, our primary focus should be on understanding and addressing the spatial density distribution of traffic accidents in NYC.
3.4 Comparative Analysis of Boroughs:
Next, we will examine spatial distribution in accident density across different boroughs in NYC in our dataset (Bronx, Queens, Kings, New York and Richmond).
Code
# Data processing# fetch GeoJson dataurl <-"https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/NYC_Census_Tracts_for_2020_US_Census/FeatureServer/0/query?where=1=1&outFields=*&outSR=4326&f=pgeojson"geojson <- rjson::fromJSON(file=url)loc <-c()for (i in1:2325){ loc[i] = geojson$features[[i]]$properties$GEOID}# df <- data.frame(matrix(ncol = 3, nrow = 0))# colnames(df) <- c("GEOID", "count", "area")# for (i in 1:2325){# df[i, ] = list(loc[i], 0, geojson$features[[i]]$properties$Shape__Area)# }# # Convert lat&lng to Geoid and save the accident counts in df# for (i in seq(1, length(data$ID))) {# res = GET("https://geo.fcc.gov/api/census/area",# query = list(lat = data[i,]$Start_Lat, lon = data[i,]$Start_Lng, censusYear=2020,format="json"))# # temp = substr(fromJSON(rawToChar(res$content))$results$block_fips[1],1,11)# df[df["GEOID"] == temp,]["count"] = df[df["GEOID"] == temp,]["count"] + 1# }# # df$density <- df$count / df$area# write.csv(df,file="preprocess/density.csv")